10 Essential Data Cleaning Methods Every Python Web Scraper Should Know（part1）

For web scraping engineers, data cleaning and storage represent the final yet most tedious step in the workflow. In enterprises scraping thousands of websites, this often becomes a dedicated role – the data cleaning specialist.

Here are the top 10 most efficient data cleaning techniques from my daily scraping practice:

1. XPath

XPath is my most frequently used HTML parsing method – mastering it solves 90%+ scraping data cleaning challenges.

Use Case: When scraped data is embedded in HTML code.

Example: Extracting Fortune Global 500 company data (2024 rankings):

import requests
from parsel import Selector

url = "https://www.fortunechina.com/fortune500/c/2024-08/05/content_456697.htm"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
response.encoding = 'utf8'

selector = Selector(text=response.text)
companies = selector.xpath('//div[@class="hf-right word-img2"]/div[@class="word-table"]/div[@class="wt-table-wrap"]/table/tbody/tr')

for company in companies:
    rank = company.xpath('./td[1]/text()').get()
    name = company.xpath('./td[2]/a/text()').get()
    revenue = company.xpath('./td[3]/text()').get()
    profit = company.xpath('./td[4]/text()').get()
    country = company.xpath('./td[5]/text()').get()
    print(rank, name, revenue, profit, country)

Key XPath Syntax:

Node selection: //div, x/div, div/text()
Predicates: div[1], div[last()], div[@class="example"]
Axes: ancestor::, following-sibling::
Fuzzy matching: contains(@href,"example.com")

2. Pandas read_html

For tabular data in HTML, Pandas offers a one-line solution:

import pandas as pd
from io import StringIO

df = pd.read_html(StringIO(response.text))[0]
print(df.head())

# Export options
df.to_excel('fortune500.xlsx') 
df.to_sql('fortune500', create_engine('mysql+pymysql://user:pass@localhost/db'))

# Quick analysis
print(df['Country'].value_counts())

Pro Tip: When facing IP blocks, integrate proxy rotation:

proxy = "http://user:pass@proxy_ip:port"
response = requests.get(url, proxies={'http': proxy, 'https': proxy})

[Continued in next part…]

These methods form the core toolkit for efficient web data extraction and transformation. The complete code examples are available on [GitHub repository].